$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$ $\newcommand{\amax}{{\text{argmax}}}$ $\newcommand{\P}{{\mathbb{P}}}$ $\newcommand{\E}{{\mathbb{E}}}$ $\newcommand{\R}{{\mathbb{R}}}$ $\newcommand{\Z}{{\mathbb{Z}}}$ $\newcommand{\N}{{\mathbb{N}}}$ $\newcommand{\C}{{\mathbb{C}}}$ $\newcommand{\abs}[1]{{ \left| #1 \right| }}$ $\newcommand{\simpl}[1]{{\Delta^{#1} }}$

%%capture

%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext tfl_training_anomaly_detection
%presentation_style
%%capture

%set_random_seed 12
%load_latex_macros

$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$ $\newcommand{\amax}{{\text{argmax}}}$ $\newcommand{\P}{{\mathbb{P}}}$ $\newcommand{\E}{{\mathbb{E}}}$ $\newcommand{\R}{{\mathbb{R}}}$ $\newcommand{\Z}{{\mathbb{Z}}}$ $\newcommand{\N}{{\mathbb{N}}}$ $\newcommand{\C}{{\mathbb{C}}}$ $\newcommand{\abs}[1]{{ \left| #1 \right| }}$ $\newcommand{\simpl}[1]{{\Delta^{#1} }}$

import numpy as np
import itertools as it
from tqdm import tqdm

import matplotlib
from matplotlib import pyplot as plt
import plotly.express as px
import pandas as pd

import ipywidgets as widgets

from tfl_training_anomaly_detection.exercise_tools import evaluate, get_kdd_data, get_house_prices_data, create_distributions, contamination, \
perform_rkde_experiment, get_mnist_data

from ipywidgets import interact

from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import IsolationForest
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KernelDensity

from tfl_training_anomaly_detection.vae import VAE, build_decoder_mnist, build_encoder_minst, build_contaminated_minst

from tensorflow import keras

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (5, 5)

Snow

Anomaly Detection

Anomaly Detection via Density Estimation

Idea: Estimate the density of $F_0$. Areas of low density are anomalous.

  • Often $p$ is too small to estimate complete mixture model
  • Takes into account that $F_1$ might not be well-defined
  • Estimation procedure needs to be robust against contamination if no clean training data is available

Kernel Density Estimation

  • Non-parametric method
  • Can represent almost arbitrarily shaped densities
  • Each training point "spreads" a fraction of the probability mass as specified by the kernel function

Definition:

  • $K: \mathbb{R} \to \mathbb{R}$ kernel function
    • $K(r) \geq 0$ for all $r\in \mathbb{R}$
    • $\int_{-\infty}^{\infty} K(r) dr = 1$
  • $h > 0$ bandwidth
  • Bandwidth is the most crucial parameter


Definition: Let $D = \{x_1,\ldots,x_N\}\subset \mathbb{R}^p$. The KDE with kernel $K$ and bandwidth $h$ is $KDE_h(x, D) = \frac{1}{N}\sum_{i=1}^N \frac{1}{h^p}K\left(\frac{|x-x_i|}{h}\right)$


Effect of bandwidth and kernel

Exercise

Play with the parameters!

dists = create_distributions(dim=2, dim_irrelevant=0)

sample_train = dists['Double Blob'].sample(500)
X_train = sample_train[-1]
y_train = [0]*len(X_train)

plt.scatter(X_train[:,0], X_train[:,1], c = 'blue', s=10)
plt.show()
2023-04-19 10:29:04.561189: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
# Helper function
def fit_kde(kernel: str, bandwidth: float, X_train: np.array):
    """ Fit KDE
    
    @param kernel: kernel
    @param bandwidth: bandwidth
    @param x_train: data
    """
    kde = KernelDensity(kernel=kernel, bandwidth=bandwidth)
    kde.fit(X_train)
    return kde

def visualize_kde(kde: KernelDensity, bandwidth: float, X_test: np.array, y_test: np.array):
    """Plot KDE
    
    @param kde: KDE
    @param bandwidth: bandwidth
    @param X_test: test data
    @param y_test: test label
    """
    fig, axis = plt.subplots(figsize=(5, 5))

    lin = np.linspace(-10, 10, 50)
    grid_points = list(it.product(lin, lin))
    ys, xs = np.meshgrid(lin, lin)
    # The score function of sklearn returns log-densities
    scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)
    colormesh = axis.contourf(xs, ys, scores)
    fig.colorbar(colormesh)
    axis.set_title('Density Conturs (Bandwidth={})'.format(bandwidth))
    axis.set_aspect('equal')
    color = ['blue' if i ==0 else 'red' for i in y_test]
    plt.scatter(X_test[:, 0], X_test[:, 1], c=color)
    plt.show()

Choose KDE Parameters

ker = None
bdw = None
@interact(
    kernel=['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'],
    bandwidth=(.1, 10.)
)
def set_kde_params(kernel: str, bandwidth: float):
    """Helper funtion to set widget parameters
    
    @param kernel: kernel
    @param bandwidth: bandwidth
    """
    global ker, bdw

    ker = kernel
    bdw = bandwidth
kde = fit_kde(ker, bdw, X_train)
visualize_kde(kde, bdw, X_train, y_train)

Bandwidth Selection

The bandwidth is the most important parameter of a KDE model. A wrongly adjusted value will lead to over- or under-smoothing of the density curve.

A common method to select a bandwidth is maximum log-likelihood cross validation. $$h_{\textrm{llcv}} = \arg\max_{h}\frac{1}{k}\sum_{i=1}^k\sum_{y\in D_i}\log\left(\frac{k}{N(k-1)}\sum_{x\in D_{-i}}K_h(x, y)\right)$$ where $D_{-i}$ is the data without the $i$th cross validation fold $D_i$.

Exercises

ex no.1: Noisy sinusoidal

# Generate example
dists = create_distributions(dim=2)

distribution_with_anomalies = contamination(
    nominal=dists['Sinusoidal'],
    anomaly=dists['Blob'],
    p=0.05
)

# Train data
sample_train = dists['Sinusoidal'].sample(500)
X_train = sample_train[-1].numpy()

# Test data
sample_test = distribution_with_anomalies.sample(500)
X_test = sample_test[-1].numpy()
y_test = sample_test[0].numpy()

scatter = plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
handels, _ = scatter.legend_elements()
plt.legend(handels, ['Nominal', 'Anomaly'])
plt.gca().set_aspect('equal')
plt.show()

TODO: Define the search space for the kernel and the bandwidth

param_space = {
    'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
    'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
}
def hyperopt_by_score(X_train: np.array, param_space: dict, cv: int=5):
    """Performs hyperoptimization by score
    
    @param X_train: data
    @param param_space: parameter space
    @param cv: number of cv folds
    """
    kde = KernelDensity()

    search = RandomizedSearchCV(
        estimator=kde,
        param_distributions=param_space,
        n_iter=100,
        cv=cv,
        scoring=None # use estimators internal scoring function, i.e. the log-probability of the validation set for KDE
    )

    search.fit(X_train)
    return search.best_params_, search.best_estimator_

Run the code below to perform hyperparameter optimization.

params, kde = hyperopt_by_score(X_train, param_space)

print('Best parameters:')
for key in params:
    print('{}: {}'.format(key, params[key]))

test_scores = -kde.score_samples(X_test)
test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)

curves = evaluate(y_test, test_scores)
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:922: UserWarning: One or more of the test scores are non-finite: [-583.74541248 -463.01696684 -482.86882001 -499.81601835 -546.99909867
 -562.2118973  -404.18630801 -441.90964061 -508.5352788           -inf
 -455.89237528 -474.60064732 -460.72114131 -523.02652677 -599.66745977
 -469.21060725 -649.46845268 -484.62263455 -534.71782939 -385.81385264
 -498.35959278 -666.08926837 -402.40788348 -616.13224704 -514.42363587
 -486.8817315  -496.70674387 -576.02539342 -519.26203456          -inf
 -476.17747255          -inf -566.20708475 -633.59732893 -460.43606272
 -519.25900889 -631.67434369 -502.48986055 -539.33069415 -611.61260259
 -416.6039647  -406.4145171  -466.23139723 -485.14283562 -490.13677541
 -637.14220215 -559.25833087 -507.88071232 -486.91719244 -562.75298394
 -414.54025058 -420.71904273 -385.56190503 -503.34657184 -526.01859062
 -519.97321291 -404.56775515 -532.78639414 -411.43385054 -486.28358272
 -606.08910552 -581.33532785 -405.37961147 -510.35741871 -667.85545766
 -593.96240838 -624.98459821 -498.05695285 -587.68751473 -523.85639014
 -535.3897791  -587.05218198 -492.07924826 -608.63368405 -500.8619376
          -inf -596.29256226 -485.75889656 -481.95324848 -431.57291947
 -505.76999431 -485.82228871 -659.05544596          -inf          -inf
 -641.6261979  -405.38065655 -613.68889441 -600.98775327 -457.76369778
 -531.39767679 -497.24276193 -603.57527611 -518.98183963 -484.21261874
 -453.4834935  -617.71569138          -inf -490.16773265 -510.35538669]
  warnings.warn(
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:929: RuntimeWarning: invalid value encountered in subtract
  array_stds = np.sqrt(np.average((array -
Best parameters:
kernel: linear
bandwidth: 1.0
visualize_kde(kde, params['bandwidth'], X_test, y_test)

Exercise: Isolate anomalies in house prices

You are a company resposible to estimate house prices around Ames, Iowa, specifically around college area. But there is a problem: houses from a nearby area, 'Veenker', are often included in your dataset. You want to build an anomaly detection algorithm that filters one by one every point that comes from the wrong neighborhood. You have been able to isolate an X_train dataset which, you are sure, contains only houses from College area. Following the previous example, test your ability to isolate anomalies in new incoming data (X_test) with KDE.

Advanced exercise: What happens if the contamination comes from other areas? You can choose among the following names:

OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste

X_train, X_test, y_test = get_house_prices_data(neighborhood = 'CollgCr', anomaly_neighborhood='Veenker')
X_train
LotArea SalePrice OverallCond
0 8461 163990 5
1 10335 204000 5
2 9548 237000 6
3 9245 145000 5
4 15523 133500 6
... ... ... ...
115 9240 287000 5
116 11317 180000 5
117 4426 141000 5
118 9066 230000 5
119 7990 110000 6

120 rows × 3 columns

# Total data
train_test_data = X_train.append(X_test, ignore_index=True)
y_total = [0] * len(X_train) + y_test

fig = px.scatter_3d(train_test_data, x='LotArea', y='OverallCond', z='SalePrice', color=y_total)

fig.show()

Solution

# When data are highly in-homogeneous, like in this case, it is often beneficial 
# to rescale them before applying any anomaly detection or clustering technique.
scaler = MinMaxScaler()
X_train_rescaled = scaler.fit_transform(X_train)
param_space = {
    'kernel': ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'], # Add available kernels
    'bandwidth': np.linspace(0.1, 10, 100), # Define Search space for bandwidth parameter
}
params, kde = hyperopt_by_score(X_train_rescaled, param_space)
/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:922: UserWarning:

One or more of the test scores are non-finite: [ -45.25900425  -42.41544626 -105.24271623  -76.62262594  -58.16553273
  -52.66553234 -183.01584751 -119.35431748  -98.69187821 -168.03134231
 -111.03706006  -62.91122917 -234.261542   -168.22031548 -220.86820955
  -92.32919548 -154.69523369 -227.42928936  -84.284484   -142.67107811
 -117.16427593 -113.6970155  -227.52633839 -161.95663169  -41.52019748
 -136.74875435 -152.97004208 -128.4919998    19.20525133          -inf
  -98.87203905          -inf -159.80490645 -162.17445626  -67.13120683
 -117.6390077  -140.62181147  -50.67220638 -237.44899903 -197.22483009
 -123.50498573 -188.99783275 -101.07642949  -72.83784089 -229.7863771
 -132.07211645 -168.26256671 -230.06793251 -135.31495507 -187.61056982
 -147.31823309          -inf -146.00497938  -33.29831913 -194.93892381
  -96.25876027 -178.48444701 -123.12220664  -83.77069893 -199.88529605
 -170.61800732 -186.74828407 -134.23720459  -35.0511072  -131.81801061
 -224.69026195 -164.15751703          -inf -217.86235331  -79.81216211
 -124.69089389  -13.75418293 -192.81244316 -167.46002124  -72.58312108
 -160.42768007          -inf -229.19908446 -159.69783332 -199.44038951
 -196.43550278 -135.24648056  -71.43898844 -191.77357892 -177.71452367
 -153.08130804  -64.75586096 -151.5744935  -104.68544216 -107.00124511
 -192.57805657    5.68816971 -158.41708204  -30.07922034 -203.15690702
 -165.74543603 -155.17305524 -243.42948442 -142.66462637 -179.45090448]

/Users/fariedabuzaid/Projects/tfl-training-practical-anomaly-detection1/tfl-training-practical-anomaly-detection/lib/python3.9/site-packages/sklearn/model_selection/_search.py:929: RuntimeWarning:

invalid value encountered in subtract

print('Best parameters:')
for key in params:
    print('{}: {}'.format(key, params[key]))

X_test_rescaled = scaler.transform(X_test)
test_scores = -kde.score_samples(X_test_rescaled)
test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)
curves = evaluate(y_test, test_scores)
Best parameters:
kernel: epanechnikov
bandwidth: 0.4

The Curse of Dimensionality

The flexibility of KDE comes at a price. The dependency on the dimensionality of the data is quite unfavorable.


Theorem [Stone, 1982] Any estimator that is consistent$^*$ with the class of all $k$-fold differentiable pdfs over $\mathbb{R}^d$ has a convergence rate of at most

$$ \frac{1}{n^{\frac{k}{2k+d}}} $$

$^*$Consistency = for all pdfs $p$ in the class: $\lim_{n\to\infty}|KDE_h(x, D) - p(x)|_\infty = 0$ with probability $1$.

Exercise

  • The very slow convergence in high dimensions does not necessary mean that we will see bad results in high dimensional anomaly detection with KDE.
  • Especially if the anomalies are very outlying.
  • However, in cases where contours of the nominal distribution are non-convex we can run into problems.

We take a look at a higher dimensional version of out previous data set.

dists = create_distributions(dim=3)

distribution_with_anomalies = contamination(
    nominal=dists['Sinusoidal'],
    anomaly=dists['Blob'],
    p=0.01
)

sample = distribution_with_anomalies.sample(500)

y = sample[0]
X = sample[-1]
fig = px.scatter_3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], color=y)
fig.show()
# Fit KDE on high dimensional examples 
rocs = []
auprs = []
bandwidths = []

param_space = {
        'kernel': ['gaussian'],
        'bandwidth': np.linspace(0.1, 100, 1000), # Define Search space for bandwidth parameter
    }

kdes = {}
dims = np.arange(2,16)
for d in tqdm(dims):
    # Generate d dimensional distributions
    dists = create_distributions(dim=d)

    distribution_with_anomalies = contamination(
        nominal=dists['Sinusoidal'],
        anomaly=dists['Blob'],
        p=0
    )

    # Train on clean data
    sample_train = dists['Sinusoidal'].sample(500)
    X_train = sample_train[-1].numpy()
    # Test data
    sample_test = distribution_with_anomalies.sample(500)
    X_test = sample_test[-1].numpy()
    y_test = sample_test[0].numpy()

    # Optimize bandwidth
    params, kde = hyperopt_by_score(X_train, param_space)
    kdes[d] = (params, kde)
    
    bandwidths.append(params['bandwidth'])

    test_scores = -kde.score_samples(X_test)
    test_scores = np.where(test_scores == np.inf, np.max(test_scores[np.isfinite(test_scores)])+1, test_scores)

    
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:39<00:00,  2.81s/it]
# Plot cross section of pdf 
fig, axes = plt.subplots(nrows=2, ncols=7, figsize=(15, 5))
for d, axis in tqdm(list(zip(kdes, axes.flatten()))):
    
    params, kde = kdes[d]

    lin = np.linspace(-10, 10, 50)
    grid_points = list(it.product(*([[0]]*(d-2)), lin, lin))
    ys, xs = np.meshgrid(lin, lin)
    # The score function of sklearn returns log-densities
    scores = np.exp(kde.score_samples(grid_points)).reshape(50, 50)
    colormesh = axis.contourf(xs, ys, scores)
    axis.set_title("Dim = {}".format(d))
    axis.set_aspect('equal')
    

# Plot evaluation
print('Crossection of the KDE at (0,...,0, x, y)')
plt.show()
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:01<00:00,  7.66it/s]
Crossection of the KDE at (0,...,0, x, y)

Robustness

Another drawback of KDE in the context of anomaly detection is that it is not robust against contamination of the data


Definition The breakdown point of an estimator is the smallest fraction of observations that need to be changed so that we can move the estimate arbitrarily far away from the true value.


Example: The sample mean has a breakdown point of $0$. Indeed, for a sample of $x_1,\ldots, x_n$ we only need to change a single value in order to move the sample mean in any way we want. That means that the breakdown point is smaller than $\frac{1}{n}$ for every $n\in\mathbb{N}$.

Robust Statistics

There are robust replacements for the sample mean:

  • Median of means: Split the dataset into $S$ equally sized subsets $X_1,\ldots, X_S$ and compute $\mathrm{median}(\overline{X_1},\ldots, \overline{X_S})$
  • M-estimation: The mean in a normed vector space is the value that minimizes the squared distances
    $\overline{X} = \min_{y}\sum_{x\in X}|x-y|^2$
    M-estimation replaces the quadratic loss with a more robust loss function.

Huber loss

Switch from quadratic to linear loss at prescribed threshold

import numpy as np


def huber(error: float, threshold: float):
    """Huber loss
    
    @param error: base error
    @param threshold: threshold for linear transition
    """
    test = (np.abs(error) <= threshold)
    return (test * (error**2)/2) + ((1-test)*threshold*(np.abs(error) - threshold/2))

x = np.linspace(-5, 5)
y = huber(x, 1)

plt.plot(x, y)
plt.gca().set_title("Huber Loss")
plt.show()

Hampel loss

More complex loss function. Depends on 3 parameters 0 < a < b< r

import numpy as np

def single_point_hampel(error: float, a: float, b: float, r: float):
    """Hampel loss
    
    @param error: base error
    @param a: 1st threshold parameter
    @param b: 2nd threshold parameter
    @param r: 3rd threshold parameter
    """
    if abs(error) <= a:
        return error**2/2
    elif a < abs(error) <= b:
        return (1/2 *a**2 + a* (abs(error)-a))
    elif  b < abs(error) <= r:
        return a * (2*b-a+(abs(error)-b) * (1+ (r-abs(error))/(r-b)))/2
    else:
        return a*(b-a+r)/2

hampel = np.vectorize(single_point_hampel)

x = np.linspace(-10.1, 10.1)
y = hampel(x, a=1.5, b=3.5, r=8)

plt.plot(x, y)
plt.gca().set_title("Hampel Loss")
plt.show()

KDE is a Mean


Kernel as scalar product:

  • Let $K$ be a radial monotonic$^\ast$ kernel over $\mathbb{R}^n$.
  • For $x\in\mathbb{R}^n$ let $\phi_x = K(\cdot, x)$.
  • Vector space over the linear span of $\{\phi_x \mid x\in\mathbb{R}^n\}$:
    • Pointwise addition and scalar multiplication.
  • Define the scalar product $\langle \phi_x, \phi_y\rangle = K(x,y)$.
  • Advantage: Scalar product is computable
  • Call this the reproducing kernel Hilbert space (RKHS) of $K$.
  • $\mathrm{KDE}_h(\cdot, D) = \frac{1}{N}\sum_{i=1}^N K_h(\cdot, x_i) = \frac{1}{N}\sum_{i=1}^N\phi_{x_i}$
    • where $K_h(x,y) = \frac{1}{h}K\left(\frac{|x-y|}{h}\right)$

$^*$All kernels that we have seen are radial and monotonic

Exercise

We compare the performance of different approaches to recover the nominal distribution under contamination. Here, we use code by Humbert et al. to replicate the results in the referenced paper on median-of-mean KDE. More details on rKDE can instead be found in this paper by Kim and Scott.

# =======================================================
#   Parameters
# =======================================================
algos = [
    'kde',
    'mom-kde', # Median-of-Means
    'rkde-huber', # robust KDE with huber loss
    'rkde-hampel', # robust KDE with hampel loss
]

dataset = 'house-prices'
dataset_options = {'neighborhood': 'CollgCr', 'anomaly_neighborhood': 'Edwards'}

outlierprop_range = [0.01, 0.02, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5]
kernel = 'gaussian'
auc_scores = perform_rkde_experiment(
    algos,
    dataset,
    dataset_options,
    outlierprop_range,
    kernel,
)
Dataset:  house-prices
Outlier prop: 0.01 (1 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 100 iterations

Outlier prop: 0.02 (2 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 10 iterations

Outlier prop: 0.03 (3 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 3 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 3 iterations
Stop at 100 iterations

Outlier prop: 0.05 (4 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 5 iterations
Stop at 3 iterations
Algo:  rkde-hampel
Stop at 5 iterations
Stop at 13 iterations

Outlier prop: 0.07 (5 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 100 iterations

Outlier prop: 0.1 (6 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 3 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 100 iterations

Outlier prop: 0.2 (7 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 5 iterations
Stop at 3 iterations
Algo:  rkde-hampel
Stop at 5 iterations
Stop at 100 iterations

Outlier prop: 0.3 (8 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 3 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 15 iterations

Outlier prop: 0.4 (9 / 10)
downsample outliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 4 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 4 iterations
Stop at 100 iterations

Outlier prop: 0.5 (10 / 10)
downsample inliers
Finding best bandwidth...
Algo:  kde
Algo:  mom-kde
Algo:  rkde-huber
Stop at 3 iterations
Stop at 2 iterations
Algo:  rkde-hampel
Stop at 3 iterations
Stop at 100 iterations
fig, ax = plt.subplots(figsize=(7, 5))
for algo, algo_data in auc_scores.groupby('algo'):
    x = algo_data.groupby('outlier_prop').mean().index
    y = algo_data.groupby('outlier_prop').mean()['auc_anomaly']
    ax.plot(x, y, 'o-', label=algo)
plt.legend()
plt.xlabel('outlier_prop')
plt.ylabel('auc_score')
plt.title('Comparison of rKDE against contamination')
Text(0.5, 1.0, 'Comparison of rKDE against contamination')

Try using different neighborhoods for contamination. Which robust KDE algorithm performs better overall? Choose among the following options:

OldTown, Veenker, Edwards, MeadowV, Somerst, NPkVill, BrDale, Gilbert, NridgHt, Sawyer, Blmngtn, Blueste

You can also change the kernel type: gaussian, tophat, epechenikov, exponential, linear or cosine,

Summary

  • Kernel density estimation is a non-parametric method to estimate a pdf from a sample.
  • Bandwidth is the most important parameter.
  • Converges to the true pdf if $n\to\infty$.
    • Convergence exponentially depends on the dimension.
  • KDE is sensitive to contamination:
    • In a contaminated setting one can employ methods from robust statistics to obtain robust estimates.

Implementations

Anomaly Detection via Isolation

Idea: An anomaly should allow "simple" descriptions that distinguish it from the rest of the data.

  • Descriptions: Conjunction of single attribute tests, i.e. $X_i \leq c$ or $X_i > c$.
  • Example: $X_1 \leq 1.2 \text{ and } X_5 > -3.4 \text{ and } X_7 \leq 5.6$.
  • Complexity of description: Number of conjunctions.

Moreover, we assume that a short random descriptions will have a significantly larger chance of isolating an anomaly than isolating any nominal point.

  • Choose random isolating descriptions and compute anomaly score from average complexity.

Isolation Tree

Isolation Forest (iForest) implements this idea by generating an ensemble of random decision trees. Each tree is built as follows:


Input: Data set (subsample) $X$, maximal height $h$

  • Randomly choose feature $i$ and split value $s$ (in range of data)
  • Recursively build subtrees on $X_L = \{x\in X\mid x_i \leq s\}$ and $X_R = X\setminus X_L$
  • Stop if remaining data set $ \leq 1$ or maximal height reached
  • Store test $x_i\leq s$ for inner nodes and $|X|$ for leaf nodes

Visualization

Isolation Tree as Partition Diagram

Isolation Depth


Input: Observation $x$

  • ${\ell} = $ length of path from root to leaf according to tests
  • ${n} = $ size of remaining data set in leaf node
  • ${c(n)} =$ expected length of a path in a BST with $n$ nodes $={O}(\log n)$
  • ${h(x)} = \ell + c(n)$

Isolation Depth of Outlier (red) and nominal (blue)

Isolation Forest

  • Train $k$ isolation trees on subsamples of size $N$
Isolation depth of nominal point (left) and outlier (right)

Variants of Isolation Forest

Variant: Random Robust Cut Forest

New Rule to Choose Split Test:

  • $\ell_i$: length of the $i$th component of the bounding box around current data set
  • Choose dimension $i$ with probability $\frac{\ell_i}{\sum_j \ell_j}$
  • More robust against "noise dimensions"

Variant: Extended Isolation Forest

New split criterion:

  • Uniformly choose a normal and an orthogonal hyperplane through the data
  • Removes a bias that was empirically observed when plotting the outlier score of iForest on low dimensional data sets

Exercise: Network Security

In the final exercise of today you will have to develop an anomaly detection system for network traffic.

Briefing

A large e-commerce company A is experiencing downtime due to attacks on their infrastructure. You were instructed to develop a system that can detect malicious connections to the infrastructure. It is planned that suspicious clients will be banned.

Another data science team already prepared the connection data of the last year for you. They also separated a test set and manually identified and labeled attacks in that data.

The Data

We will work on a version of the classic KDD99 data set.

Kddcup 99 Data Set


The KDD Cup '99 dataset was created by processing the tcp dump portions of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, created by MIT Lincoln Lab [1]. The artificial data (described on the dataset's homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>_) was generated using a closed network and hand-injected attacks to produce a large number of different types of attack with normal activity in the background.

=========================   ====================================================
Samples total               976158
Dimensionality              41
Features                    string (str), discrete (int), continuous (float)
Targets                     str, 'normal.' or name of the anomaly type
Proportion of Anomalies     1%
=========================   ====================================================


Task

You will have to develop the system on your own. In particular, you will have to

  • Explore the data.
  • Choose an algorithm.
  • Find a good detection threshold.
  • Evaluate and summarize your results.
  • Estimate how much A could save through the use of your system.
X_train,X_test,y_test = get_kdd_data()

Explore Data

#
# Add your exploration code
#
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
# get description
X_train.describe()
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 36 37 38 39 40
count 80524 80524 80524 80524 80524 80524 80524 80524 80524 80524 ... 80524 80524 80524.0 80524.0 80524.0 80524.0 80524.0 80524.0 80524.0 80524.0
unique 2027 3 51 10 3028 9749 2 3 2 19 ... 256 256 101.0 97.0 101.0 54.0 89.0 50.0 101.0 101.0
top 0 b'tcp' b'http' b'SF' 105 0 0 0 0 0 ... 255 255 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
freq 71098 62139 49451 75374 5928 13958 80522 80516 80523 80057 ... 34708 45956 52140.0 51505.0 27038.0 40645.0 76024.0 75584.0 72942.0 73012.0

4 rows × 41 columns

# get better description
X_train.drop(columns=[1,2,3]).astype(float).describe()
0 4 5 6 7 8 9 10 11 12 ... 31 32 33 34 35 36 37 38 39 40
count 80524.000000 8.052400e+04 8.052400e+04 80524.000000 80524.000000 80524.000000 80524.000000 80524.000000 80524.000000 80524.000000 ... 80524.000000 80524.000000 80524.000000 80524.000000 80524.000000 80524.000000 80524.000000 80524.000000 80524.000000 80524.000000
mean 212.613432 1.233283e+03 3.335463e+03 0.000025 0.000273 0.000037 0.045018 0.000236 0.694774 0.034474 ... 152.213849 201.341637 0.840678 0.056100 0.154289 0.023343 0.009411 0.008416 0.057195 0.055349
std 1335.788840 3.889103e+04 3.920498e+04 0.004984 0.028191 0.010572 0.860065 0.023107 0.460506 4.447875 ... 103.504901 88.014482 0.311367 0.179141 0.307565 0.049512 0.089834 0.086773 0.224117 0.218304
min 0.000000 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 1.460000e+02 1.050000e+02 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 40.000000 169.000000 0.910000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 2.320000e+02 3.920000e+02 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 ... 175.000000 255.000000 1.000000 0.000000 0.010000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 3.170000e+02 2.013000e+03 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 ... 255.000000 255.000000 1.000000 0.010000 0.090000 0.030000 0.000000 0.000000 0.000000 0.000000
max 30190.000000 5.135678e+06 5.134218e+06 1.000000 3.000000 3.000000 30.000000 4.000000 1.000000 884.000000 ... 255.000000 255.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 38 columns

# Check for NaNs
print("Number of NaNs: {}".format(X_train.isna().sum().sum()))
Number of NaNs: 0
#
# Add your preperation code here
#

# Encode string features
binarizer = LabelBinarizer()
one_hots = None
one_hots_test = None
for i in [1, 2, 3]:
    binarizer.fit(X_train[[i]].astype(str))
    if one_hots is None:
        one_hots = binarizer.transform(X_train[[i]].astype(str))
        one_hots_test = binarizer.transform(X_test[[i]].astype(str))
    else:
        one_hots = np.concatenate([one_hots, binarizer.transform(X_train[[i]].astype(str))], axis=1)
        one_hots_test = np.concatenate([one_hots_test, binarizer.transform(X_test[[i]].astype(str))], axis=1)

X_train.drop(columns=[1,2,3], inplace=True)
X_train_onehot = pd.DataFrame(np.concatenate([X_train.values, one_hots], axis=1))

X_test.drop(columns=[1,2,3], inplace=True)
X_test_onehot = pd.DataFrame(np.concatenate([X_test.values, one_hots_test], axis=1))
# Encode y
y_test_bin = np.where(y_test == b'normal.', 0, 1)
# Remove suspicious data
# This step is not strictly neccessary but can improve performance
suspicious = X_train_onehot.apply(lambda col: (col - col.mean()).abs() > 4 * col.std() if col.std() > 1 else False)
suspicious = suspicious.any(axis=1)# 4 sigma rule
print('filtering {} suspicious data points'.format(suspicious.sum()))
X_train_clean = X_train_onehot[~suspicious]
filtering 2951 suspicious data points

Summary

  • Isolation Forest empirically shows very good performance up to relatively high dimensions
  • It is relatively robust against contamination
  • Usually little need for hyperparameter tuning

Implementations

Choose Algorithm

# TODO: implement proper model selection
iforest = IsolationForest()
iforest.fit(X_train_clean)
IsolationForest()
# find best threshold
X_test_onehot, X_val_onehot, y_test_bin, y_val_bin = train_test_split(X_test_onehot, y_test_bin, test_size=.5)
y_score = -iforest.score_samples(X_val_onehot).reshape(-1)

Evaluate Solution

#
# Insert evaluation code
#

# calculate scores if any anomaly is present
if np.any(y_val_bin == 1):
    eval = evaluate(y_val_bin, y_score)
    prec, rec, thr = eval['PR']
    f1s = 2 * (prec * rec)/(prec + rec)
    threshold = thr[np.argmax(f1s)]

    y_score = -iforest.score_samples(X_test_onehot).reshape(-1)
    y_pred = np.where(y_score < threshold, 0, 1)

    print('Precision: {}'.format(metrics.precision_score(y_test_bin, y_pred)))
    print('Recall: {}'.format(metrics.recall_score(y_test_bin, y_pred)))
    print('F1: {}'.format(metrics.f1_score(y_test_bin, y_pred)))
Precision: 0.8794520547945206
Recall: 0.9968944099378882
F1: 0.9344978165938865

Anomaly Detection via Reconstruction Error

Idea: Embed the data into low dimensional space and reconstruct it again. Good embedding of nominal data $\Rightarrow$ high reconstruction error indicates anomaly.

Autoencoder:

  • Parametric family of encoders: $f_\phi: \mathbb{R}^d \to \mathbb{R}^{\text{low}}$
  • Parametric family of decoders: $g_\theta: \mathbb{R}^{\text{low}} \to \mathbb{R}^{d}$
  • Reconstruction error of $(f_\phi, g_\theta)$ on $x$: $|x - g_\theta(f_\phi(x))|$
  • Given data set $D$, find $\phi,\theta$ that minimize $\sum_{x\in D} L(|x- g_\theta(f_\phi(x))|) $ for some loss function $L$.

Visualization

Autoencoder Schema

Neural Networks

Neural networks are very well suited for finding low dimensional representations of data. Hence they are a popular choice for the encoder and the decoder.


Artificial Neuron with $N$ inputs: $y = \sigma\left(\sum_i^N w_i X_i + b\right)$

  • $\sigma$: nonlinear activation-function (applied component wise).
  • $b$ bias

Isolation depth of nominal point and anomaly

Neural Networks

Neural networks combine many artificial neurons into a complex network. These networks are usually organized in layers where the result of each layer is the input for the next layer. Some commonly used layers are:

Variational Autoencoders

An important extension of autoencoders that relates the idea to density estimation. More precisely, we define a generative model for our data using latent variables and combine the maximum likelihood estimation of the parameters with a simultaneous posterior estimation of the latents through amortized stochastic variational inference. We use a decoder network to transform the latent variables into the data distribution, and an encoder network to compute the posterior distribution of the latents given the data.


Definition: The model uses an observed variable $X$ (the data) and a latent variable $Z$ (the defining features of $X$). We assume both $P(Z)$ and $P(X\mid Z)$ to be normally distributed. More precisely

  • $P(Z) = \mathcal{N}(0, I)$
  • $P(X\mid Z) = \mathcal{N}(\mu_\phi(Z), I)$

where $\mu_\phi$ is a neural network parametrized with $\phi$. We use variational inference to perform posterior inference on $Z$ given $X$. We assume that the distribution $P(Z\mid X)$ to be relatively well approximated by a Gaussian and use the posterior approximation:

  • $q(X\mid Z) = \mathcal{N}(\mu_\psi(X), \sigma_\psi(X))$

$\mu_\psi$ and $\sigma_\psi$ are neural networks parameterized with $\psi$


Given a data set $D$ we minimize the (amortized) Kullback-Leibler divergence between our posterior approximation and the true posterior: \begin{align*} D_{KL}(q(z\mid x),p(z\mid x)) &= E_{x\sim X, z\sim q(Z\mid x)}\left[\log\left(\frac{q(z \mid x)}{p(z \mid X)}\right)\right] \\ &= E_{x\sim X, z\sim q(Z\mid X)}\left[\log\left(\frac{q(z \mid x)}{\frac{p(x \mid z)p(z)}{p(x)}}\right)\right] \\ &= E_{x\sim X, z\sim q(Z\mid x)}\left[\log\left(\frac{q(z \mid x)}{p(x \mid z)p(z)}\right) + \log(p(x))\right] \\ &= E_{x\sim X, z\sim q(Z\mid x)}\left[\log\left(\frac{q(z \mid x)}{p(x \mid z)p(z)}\right)\right] + E_{x\sim X}[\log(p(x))]\\ \end{align*}

Now we can define

\begin{align*} \mathrm{ELBO}(q(z\mid x),p(z\mid x)) &:= E_{x\sim X}[\log(p(x))] - D_{KL}(q(z\mid x),p(z\mid x)) \\ &= -E_{x\sim X, z\sim q(Z\mid x)}\left[\log\left(\frac{q(z \mid x)}{p(x \mid z)p(z)}\right)\right] \end{align*}

Note that we can evaluate the expression inside the expectation of the final RHS of the equation and we can obtain unbiased estmates of the expectation via sampling. Let us further try to understand the ELBO as an optimization objective. On one hand, maximizing the ELBO with respect to the parameters in $q$ is equivalent to minimizing the KL divergence between $p$ and $q$. On the other hand, maximizing the ELBO with respect to the parameters in $p$ can be understood as raising a lower bound for the likelihood of the generative model $p(x)$. Hence, the optimization tries to find an encoder and a decoder pair such that it simultaneously provides a good generative explanation of the data and a good approximation of the posterior distribution of the latent variables.

Exercise

The MNIST Data Set

MNIST is one of the most iconic data sets in the history of machine learning. It contains 70000 samples of $28\times 28$ grayscale images of handwritten digits. Because of its moderate complexity and good visualizability it is well suited to study the behavior of machine learning algorithms in higher dimensional spaces.

While originally created for classification (optical character recognition), we can build an anomaly detection data set by corrupting some of the images.

Pre-processing

We first need to obtain the MNIST data set and prepare an anomaly detection set from it. Note that the data set is n row vector format. Therefore, we work with $28\times 28 = 784$ dimensional data points.

Load MNIST Data Set

mnist = get_mnist_data()

data = mnist['data']
print('data.shape: {}'.format(data.shape))
target = mnist['target'].astype(int)
data.shape: (70000, 784)

Build contaminated Data Sets

We prepared a function that does the job for us. It corrupts a prescribed portion of the data by introducing a rotation, noise or a blackout of some part of the image.

First, we need to transform the data into image format.

X = data.reshape(-1, 28, 28, 1)/255

Train/Test-Split

We will only corrupt the test set, hence we will perform the train-test split beforehand. We separate a relatively small test set so that we can use as much as possible from the data to obtain high quality representations.

test_size = .1
X_train, X_test, target_train, target_test = train_test_split(X, target, test_size=test_size)
X_test, y_test = build_contaminated_minst(X_test)

# Visualize contamination
anomalies = X_test[y_test != 0]
selection = np.random.choice(len(anomalies), 25)

fig, axes = plt.subplots(nrows=5, ncols=5, figsize=(5, 5))
for img, ax in zip(anomalies[selection], axes.flatten()):
    ax.imshow(img, 'gray')
    ax.axis('off')
plt.show()

Autoencoder

Let us finally train an autoencoder model. We replicate the model given in the Keras documentation and apply it in a synthetic outlier detection scenario based on MNIST.

in the vae package we provide the implementation of the VAE. Please take a look into the source code to see how the minimization of the KL divergence is implemented.

Create Model

latent_dim = 3
vae = VAE(decoder=build_decoder_mnist(latent_dim=latent_dim), encoder=build_encoder_minst(latent_dim=latent_dim))
## Inspect model architecture
vae.encoder.summary()
Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_2 (InputLayer)            [(None, 28, 28, 1)]  0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 14, 14, 32)   320         input_2[0][0]                    
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 7, 7, 64)     18496       conv2d[0][0]                     
__________________________________________________________________________________________________
flatten (Flatten)               (None, 3136)         0           conv2d_1[0][0]                   
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 16)           50192       flatten[0][0]                    
__________________________________________________________________________________________________
z_mean (Dense)                  (None, 3)            51          dense_1[0][0]                    
__________________________________________________________________________________________________
z_log_var (Dense)               (None, 3)            51          dense_1[0][0]                    
__________________________________________________________________________________________________
sampling (Sampling)             (None, 3)            0           z_mean[0][0]                     
                                                                 z_log_var[0][0]                  
==================================================================================================
Total params: 69,110
Trainable params: 69,110
Non-trainable params: 0
__________________________________________________________________________________________________
## Inspect model architecture
vae.decoder.summary()
Model: "decoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 3)]               0         
_________________________________________________________________
dense (Dense)                (None, 3136)              12544     
_________________________________________________________________
reshape (Reshape)            (None, 7, 7, 64)          0         
_________________________________________________________________
conv2d_transpose (Conv2DTran (None, 14, 14, 64)        36928     
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 28, 28, 32)        18464     
_________________________________________________________________
conv2d_transpose_2 (Conv2DTr (None, 28, 28, 1)         289       
=================================================================
Total params: 68,225
Trainable params: 68,225
Non-trainable params: 0
_________________________________________________________________
# train model
n_epochs = 30

vae.compile(optimizer=keras.optimizers.Adam(learning_rate=.001))
history = vae.fit(X_train, epochs=n_epochs, batch_size=128)
2023-04-19 10:31:23.055180: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/30
493/493 [==============================] - 46s 91ms/step - loss: 43.9802 - reconstruction_loss: 36.0458 - kl_loss: 0.3257
Epoch 2/30
493/493 [==============================] - 48s 98ms/step - loss: 31.6917 - reconstruction_loss: 29.0970 - kl_loss: 1.9113
Epoch 3/30
493/493 [==============================] - 47s 96ms/step - loss: 30.1662 - reconstruction_loss: 27.5706 - kl_loss: 2.4761
Epoch 4/30
493/493 [==============================] - 47s 96ms/step - loss: 29.7152 - reconstruction_loss: 26.5361 - kl_loss: 2.9571
Epoch 5/30
493/493 [==============================] - 47s 96ms/step - loss: 28.9498 - reconstruction_loss: 25.2538 - kl_loss: 3.5752
Epoch 6/30
493/493 [==============================] - 47s 95ms/step - loss: 28.5414 - reconstruction_loss: 24.6374 - kl_loss: 3.8247
Epoch 7/30
493/493 [==============================] - 51s 104ms/step - loss: 28.3047 - reconstruction_loss: 24.3351 - kl_loss: 3.9369
Epoch 8/30
493/493 [==============================] - 52s 105ms/step - loss: 28.1957 - reconstruction_loss: 24.1238 - kl_loss: 4.0250
Epoch 9/30
493/493 [==============================] - 52s 106ms/step - loss: 28.0770 - reconstruction_loss: 23.9851 - kl_loss: 4.0686
Epoch 10/30
493/493 [==============================] - 54s 109ms/step - loss: 28.0162 - reconstruction_loss: 23.8709 - kl_loss: 4.1166
Epoch 11/30
493/493 [==============================] - 54s 110ms/step - loss: 27.9992 - reconstruction_loss: 23.7733 - kl_loss: 4.1561
Epoch 12/30
493/493 [==============================] - 54s 110ms/step - loss: 27.9469 - reconstruction_loss: 23.6861 - kl_loss: 4.1964
Epoch 13/30
493/493 [==============================] - 56s 113ms/step - loss: 27.9455 - reconstruction_loss: 23.6237 - kl_loss: 4.2143
Epoch 14/30
493/493 [==============================] - 59s 119ms/step - loss: 27.8618 - reconstruction_loss: 23.5419 - kl_loss: 4.2608
Epoch 15/30
493/493 [==============================] - 59s 120ms/step - loss: 27.7890 - reconstruction_loss: 23.4880 - kl_loss: 4.2745
Epoch 16/30
493/493 [==============================] - 59s 120ms/step - loss: 27.7563 - reconstruction_loss: 23.4388 - kl_loss: 4.3066
Epoch 17/30
493/493 [==============================] - 58s 118ms/step - loss: 27.6718 - reconstruction_loss: 23.3834 - kl_loss: 4.3206
Epoch 18/30
493/493 [==============================] - 57s 117ms/step - loss: 27.6985 - reconstruction_loss: 23.3397 - kl_loss: 4.3462
Epoch 19/30
493/493 [==============================] - 58s 118ms/step - loss: 27.7143 - reconstruction_loss: 23.3263 - kl_loss: 4.3451
Epoch 20/30
493/493 [==============================] - 58s 117ms/step - loss: 27.6340 - reconstruction_loss: 23.2739 - kl_loss: 4.3725
Epoch 21/30
493/493 [==============================] - 58s 117ms/step - loss: 27.6250 - reconstruction_loss: 23.2324 - kl_loss: 4.3844
Epoch 22/30
493/493 [==============================] - 58s 117ms/step - loss: 27.6293 - reconstruction_loss: 23.1890 - kl_loss: 4.4027
Epoch 23/30
493/493 [==============================] - 58s 117ms/step - loss: 27.6002 - reconstruction_loss: 23.1515 - kl_loss: 4.4319
Epoch 24/30
493/493 [==============================] - 58s 117ms/step - loss: 27.6394 - reconstruction_loss: 23.1220 - kl_loss: 4.4499
Epoch 25/30
493/493 [==============================] - 53s 108ms/step - loss: 27.5510 - reconstruction_loss: 23.1140 - kl_loss: 4.4538
Epoch 26/30
493/493 [==============================] - 34s 70ms/step - loss: 27.5861 - reconstruction_loss: 23.0813 - kl_loss: 4.4622
Epoch 27/30
493/493 [==============================] - 34s 70ms/step - loss: 27.5400 - reconstruction_loss: 23.0504 - kl_loss: 4.4710
Epoch 28/30
493/493 [==============================] - 34s 69ms/step - loss: 27.5267 - reconstruction_loss: 23.0238 - kl_loss: 4.4779
Epoch 29/30
493/493 [==============================] - 34s 70ms/step - loss: 27.5144 - reconstruction_loss: 23.0092 - kl_loss: 4.4840
Epoch 30/30
493/493 [==============================] - 34s 70ms/step - loss: 27.5164 - reconstruction_loss: 22.9949 - kl_loss: 4.5029

Inspect Result

import matplotlib.pyplot as plt


def plot_latent_space(vae: VAE, n: int=10, figsize: float=10):
    """Plot sample images from 2D slices of latent space
    
    @param vae: vae model
    @param n: sample nXn images per slice
    @param figsize: figure size
    
    """
    for perm in [[0, 1, 2], [1, 2, 0], [2, 1, 0]]:
        # display a n*n 2D manifold of digits
        digit_size = 28
        scale = 1.0
        figure = np.zeros((digit_size * n, digit_size * n))
        # linearly spaced coordinates corresponding to the 2D plot
        # of digit classes in the latent space
        grid_x = np.linspace(-scale, scale, n)
        grid_y = np.linspace(-scale, scale, n)[::-1]

        for i, yi in enumerate(grid_y):
            for j, xi in enumerate(grid_x):
                z_sample = np.array([[xi, yi, 0]])
                z_sample[0] = z_sample[0][perm]
                x_decoded = vae.decoder.predict(z_sample)
                digit = x_decoded[0].reshape(digit_size, digit_size)
                figure[
                    i * digit_size : (i + 1) * digit_size,
                    j * digit_size : (j + 1) * digit_size,
                ] = digit

        plt.figure(figsize=(figsize, figsize))
        start_range = digit_size // 2
        end_range = n * digit_size + start_range
        pixel_range = np.arange(start_range, end_range, digit_size)
        sample_range_x = np.round(grid_x, 1)
        sample_range_y = np.round(grid_y, 1)
        plt.xticks(pixel_range, sample_range_x)
        plt.yticks(pixel_range, sample_range_y)
        plt.xlabel("z[{}]".format(perm[0]))
        plt.ylabel("z[{}]".format(perm[1]))
        plt.gca().set_title('z[{}] = 0'.format(perm[2]))
        plt.imshow(figure, cmap="Greys_r")
        plt.show()
plot_latent_space(vae)
# Principal components
pca = PCA()
latents = vae.encoder.predict(X_train)[2]
pca.fit(latents)

kwargs = {'x_{}'.format(i): (-1., 1.) for i in range(latent_dim)}


@widgets.interact(**kwargs)
def explore_latent_space(**kwargs):
    """Widget to explore latent space from given start position
    """
    center_img = pca.transform(np.zeros([1,latent_dim]))

    latent_rep_pca =  center_img + np.array([[kwargs[key] for key in kwargs]])
    latent_rep = pca.inverse_transform(latent_rep_pca)
    img = vae.decoder(latent_rep).numpy().reshape(28, 28)

    fig, ax = plt.subplots()
    ax.axis('off')
    ax.axis('off')

    ax.imshow(img,cmap='gray', vmin=0, vmax=1)
    plt.show()
latents = vae.encoder.predict(X_train)[2]
scatter = px.scatter_3d(x=latents[:, 0], y=latents[:, 1], z=latents[:, 2], color=target_train)

scatter.show()
latents = vae.encoder.predict(X_test)[2]
scatter = px.scatter_3d(x=latents[:, 0], y=latents[:, 1], z=latents[:, 2], color=y_test)

scatter.show()
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test)
n_samples = 10

s = np.random.choice(range(len(X_val)), n_samples)
s = X_val[s]
#s = [X_train_img[i] for i in s]

fig, axes = plt.subplots(nrows=2, ncols=n_samples, figsize=(10, 2))
for img, ax_row in zip(s, axes.T):
    x = vae.decoder.predict(vae.encoder.predict(img.reshape(1, 28, 28, 1))[2]).reshape(28, 28)
    diff = x - img.reshape(28, 28)
    error = (diff * diff).sum()
    ax_row[0].axis('off')
    ax_row[1].axis('off')
    ax_row[0].imshow(img,cmap='gray', vmin=0, vmax=1)
    ax_row[1].imshow(x, cmap='gray', vmin=0, vmax=1)
    ax_row[1].set_title('E={:.1f}'.format(error))

plt.tight_layout()
plt.show()
from sklearn import metrics
y_test_bin = y_test.copy()
y_test_bin[y_test != 0] = 1
y_val_bin = y_val.copy()
y_val_bin[y_val != 0] = 1
# Evaluate
reconstruction = vae.decoder.predict(vae.encoder(X_val)[2])
rerrors = (reconstruction - X_val).reshape(-1, 28*28)
rerrors = (rerrors * rerrors).sum(axis=1)

# Let's calculate scores if any anomaly is present
if np.any(y_val_bin == 1):
    eval = evaluate(y_val_bin.astype(int), rerrors.astype(float))
    pr, rec, thr = eval['PR']
    f1s = (2 * ((pr * rec)[:-1]/(pr + rec)[:-1]))
    threshold = thr[np.argmax(f1s)]
    print('Optimal threshold: {}'.format(threshold))

    reconstruction = vae.decoder.predict(vae.encoder(X_test)[2])
    reconstruction_error = (reconstruction - X_test).reshape(-1, 28*28)
    reconstruction_error = (reconstruction_error * reconstruction_error).sum(axis=1)


    classification = (reconstruction_error > threshold).astype(int)

    print('Precision: {}'.format(metrics.precision_score(y_test_bin, classification)))
    print('Recall: {}'.format(metrics.recall_score(y_test_bin, classification)))
    print('F1: {}'.format(metrics.f1_score(y_test_bin, classification)))

    metrics.confusion_matrix(y_test_bin, classification)
else:
    reconstruction_error = None
Optimal threshold: 90.29661704732457
Precision: 0.78
Recall: 0.4020618556701031
F1: 0.5306122448979592

Sort Data by Reconstruction Error

if reconstruction_error is not None:
    combined = list(zip(X_test, reconstruction_error))
    combined.sort(key = lambda x: x[1])

Show Top Autoencoder Outliers

if reconstruction_error is not None:
    n_rows = 10
    n_cols = 10
    n_samples = n_rows*n_cols

    samples = [c[0] for c in combined[-n_samples:]]

    fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(2*n_cols, 2*n_rows))
    for img, ax in zip(samples, axes.reshape(-1)):
        ax.axis('off')
        ax.imshow(img.reshape((28,28)), cmap='gray', vmin=0, vmax=1)

    plt.show()

Summary

  • Autoencoders are the most prominent reconstruction error based anomaly detection method.
  • Can provide high quality results on high dimensional data.
  • Architecture is highly adaptable to the data (fully connected, CNN, attention,...).
  • Sensitive to contamination.
  • Variational autoencoder are an important variant the improves the interpretability of the latent space.

Implementations

Snow